Comparing Data Streams via Sketching

نویسندگان

  • Emmanuelle Anceaume
  • Yann Busnel
چکیده

We consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the Sketch ⋆-metric, which allows to define a distance between updatable summaries (or sketches) of large data streams. An important feature of the Sketch ⋆-metric is that, given a measure on the entire initial data streams, the Sketch ⋆-metric preserves the axioms of the latter measure on the sketch (such as the non-negativity, the identity, the symmetry, the triangle inequality but also specific properties of the f -divergence or the Bregman one). Extensive experiments conducted on both synthetic traces and real data sets allow us to validate the robustness and accuracy of the Sketch ⋆-metric. Key-words: Data stream; metric; randomized approximation algorithm. Sketch ⋆-metrique: Comparaison de flots de donnes base sur des rsums (“sketch”) Résumé : Nous étudions le problème li l’estimation de la distance entre de flots de données quelconques sous hypothèse de calcul et mmoire limitée. Ce problme s’avère être très important dans les applications de monitoring où les flots de données sont générés rapidement. Mots clés : Flots de données, algorithme d’approximation randomizé. * CNRS UMR 6074 IRISA, [email protected], CIDRE ** LINA, Université de Nantes, [email protected], ATLAS-GDD c ©IRISA – Campus de Beaulieu – 35042 Rennes Cedex – France – +33 2 99 84 71 00 – www.irisa.fr ha l-0 07 64 77 2, v er si on 1 13 D ec 2 01 2 2 Emmanuelle Anceaume Yann Busnel

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sketch ?-metric: Comparing Data Streams via Sketching RESEARCH REPORT

In this paper, we consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the S...

متن کامل

Sketch \star-metric: Comparing Data Streams via Sketching

In this paper, we consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the S...

متن کامل

Corrections to “LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams”

In this article, we describe the corrections to our paper “LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams” published at IEEE INFOCOM 2014. We also clarify the complexity issue raised by some readers. 1 Corrections to Lemmas and Theorems

متن کامل

Improved Sketching of Hamming Distance with Error Correcting

We address the problem of sketching the hamming distance of data streams. We present a new notion of sketching technique, Fixable sketches and we show that using such sketch not only we reduce the sketch size, but also restore the differences between the streams. Our contribution: For two streams with hamming distance bounded by k we show a sketch of size O(k logn) with O(logn) processing time ...

متن کامل

Algorithmic Techniques for Processing Data Streams

We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams. 1998 ACM Subject Classification F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012